Advertisement insertion points detection for online video advertising

ABSTRACT

Systems and methods for determining insertion points in a first video stream are described. The insertions points being configured for inserting at least one second video into the first video. In accordance with one embodiment, a method for determining the insertion points includes parsing the first video into a plurality of shots. The plurality of shots includes one or more shot boundaries. The method then determines one or more insertion points by balancing a discontinuity metric and an attractiveness metric of each shot boundary.

BACKGROUND

Video advertisements typically have a greater impact on viewers thantraditional online text-based advertisements. Internet users frequentlystream online source video for viewing. A search engine may have indexedsuch a source video. The source video may be a video stream from a livecamera, a movie, or any videos accessed over a network. If a sourcevideo includes a video advertisement clip (a short video, an animationsuch as a Flash or GIF, still images, etc.), a human being has typicallymanually inserted the video advertisement clip into the source video.Manually inserting advertisement video clips into source video is atime-consuming and labor-intensive process that does not take intoaccount the real-time nature of interactive user browsing and playbackof online source video.

SUMMARY

This Summary is provided to introduce a selection of concepts in asimplified form that is further described below in the DetailedDescription. This Summary is not intended to identify key features oressential features of the claimed subject matter, nor is it intended tobe used to limit the scope of the claimed subject matter.

Described herein are embodiments of various technologies for determiningof insertion points for a first video stream. In one embodiment, amethod for determining the video advertisement insertion points includesparsing a first video into a plurality of shots. The plurality of shotsincludes one or more shot boundaries. The method then determines one ormore insertion points by balancing the discontinuity and attractivenessfor each of the one or more shot boundaries. The insertions points areconfigured for inserting at least one second video into the first video.

In a particular embodiment, the determination of the insertion pointsincludes computing a degree of discontinuity for each of the one or moreshot boundaries. Likewise, the determination of the insertion pointsalso includes computing a degree of attractiveness for each of the oneor more shot boundaries. The insertion points are then determined basedon the degree of discontinuity and the degree of attractiveness of eachshot boundary. Once the insertion points are determined, at least onesecond video is inserted at each of the determined insert points so thatan integrated video stream is formed. In a further embodiment, theintegrated video stream is provided to a viewer for playback. In turn,the viewer may assess the effectiveness of the insertion points based onviewer feedback to a played integrated video stream.

In another embodiment, a computer readable medium for determininginsertion points for a first video stream includes computer-executableinstructions. The computer executable instructions, when executed,perform acts that comprise parsing the first video into a plurality ofshots. The plurality of shots includes one or more shot boundaries. Adegree of discontinuity for each of the one or more shot boundaries isthen computed. Likewise, a degree of attractiveness for each of the oneor more shot boundaries is also computed. The insertion points are thendetermined based on the degree of discontinuity and the degree ofattractiveness of each shot boundary. The insertions points beingconfigured for inserting at least one second video into the first video.

Once the insertion points are determined, at least one second video isinserted at each of the determined insert points so that an integratedvideo stream is formed.

In an additional embodiment, a system for determining insertion pointsfor a first video stream comprises one or more processors. The systemalso comprises memory allocated for storing a plurality ofcomputer-executable instructions that are executable by the one or moreprocessors. The computer-executable instructions comprise instructionsfor parsing the first video into a plurality of shots, the plurality ofshots includes one or more shot boundaries, computing a degree ofdiscontinuity for each of the one or more shot boundaries, computing adegree of attractiveness for each of the one or more shot boundaries.The instructions also enable the determination of the one or moreinsertion points based on the degree of discontinuity and the degree ofattractiveness of each shot boundary. The insertions points beingconfigured for inserting at least one second video into the first video.Finally, the instructions further facilitate the insertion of the atleast one second video at each of the determined insert points to forman integrated video stream.

Other embodiments will become more apparent from the following detaileddescription when taken in conjunction with the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

The detailed description is described with reference to the accompanyingfigures. In the figures, the left-most digit(s) of a reference numberidentifies the figure in which the reference number first appears. Theuse of the same reference number in different figures indicates similaror identical items.

FIG. 1 is a simplified block diagram that illustrates an exemplary videoadvertisement insertion process.

FIG. 2 is a simplified block diagram that illustrates selectedcomponents of a video insertion engine that may be implemented in therepresentative computing environment shown in FIG. 12.

FIG. 3 is a diagram that illustrates exemplary schemes for thedetermination of video advertisement insertion points.

FIG. 4 depicts an exemplary flow diagram for determining videoadvertisement insertion points using the representative computingenvironment shown in FIG. 12.

FIG. 5 depicts an exemplary flow diagram for gauging viewer attentionusing the representative computing environment shown in FIG. 12.

FIG. 6 is a flow diagram showing an illustrative process gauging visualattention using the representative computing environment shown in FIG.12.

FIG. 7 is a flow diagram showing an illustrative process gauging audioattention using the representative computing environment shown in FIG.12.

FIG. 8 is a diagram that illustrates the use of the twin-comparisonmethod on consecutive frames in a video segment, such as a shot from avideo source, using the representative computing environment shown inFIG. 12.

FIG. 9 is a diagram that illustrates the proximity of videoadvertisement insertion points to a plurality parsed shots in a sourcevideo.

FIG. 10 is a diagram that illustrates a Left-Right Hidden Markov Modelthat is included in a best first merge model (BFMM).

FIG. 11 is a diagram that illustrates exemplary models of camera motionused for visual attention modeling.

FIG. 12 is a simplified block diagram that illustrates a representativecomputing environment. The representative environment may be a part of acomputing device. Moreover, the representative computing environment maybe used to implement the advertisement insertion point determinationtechniques and mechanisms described herein.

DETAILED DESCRIPTION

This disclosure is directed to systems and methods that facilitate theinsertion of video advertisements into source videos. A typical videoadvertisement, or advertising clip, is a short video that includes text,image, or animation. Video advertisements may be inserted into a sourcevideo stream so that as a viewer watches the source video, the viewer isautomatically presented with advertising clips at one or more pointsduring playback. For instance, video advertisements may be superimposedinto selected frames of a source video. A specific example may be ananimation that appears and then disappears on the lower right corner ofthe video source. In other instances, video advertisements may bedisplayed in separate streams beside the source video stream. Forexample, a video advertisement may be presented in a separate viewingarea during at least some duration of the source video playback. Bypresenting video advertisements simultaneously with the source video,the likelihood that the video advertisements will receive notice by theviewer may be enhanced.

The systems and methods in accordance with this disclosure determine oneor more positions in the timeline of a source video stream where videoadvertisements may be inserted. These timeline positions may also bereferred to as insertion points. According to various embodiments, theone or more insertion points may be positioned so that the impact of theinserted advertising clips on the viewer is maximized. The determinationof video advertisement insertion points in a video source stream aredescribed below with reference to FIGS. 1-12.

Exemplary Insertion Point Determination Concept

FIG. 1 shows an exemplary video advertisement insertion system 100. Thevideo advertisement insertion system 100 enables content providers 102to provide video sources 104. The content providers 102 may includeanyone who owns video content, and is willing to disseminate such videocontent to the general public. For example, the content providers 102may include professional as well as amateur artists. The video sources104 are generally machine-readable works that contain a plurality ofimages, such as movies, video clips, homemade videos, etc.

Advertisers 106 may produce video advertisements 108. The videoadvertisements 108 are generally one or more images intended to generateviewer interest in particular goods, services, or points of view. Inmany instances, a video advertisement 108 may be a video clip. The videoclip may be approximately 10-30 seconds in duration. In the exemplarysystem 100, the video sources 104 and the video advertisements 108 maybe transferred to an advertising service 110 via one or more networks112. The one or more networks 112 may include wide-area networks (WANs),local area networks (LANs), or other network architectures.

The advertising service 110 is generally configured to integrate thevideo sources 104 with the video advertisements 108. Specifically, theadvertising service 110 may use the video insertion engine 114 to matchportions of the video source 104 with video advertisements 108.According to various implementations, the video insertion engine 114 maydetermine one or more insertion points 116 in the time line of the videosource 104. The video insertion engine 114 may then insert one or morevideo advertisements 118 at the insertion points 116. The integration ofthe video source 104 and the one or more video advertisements 118produces an integrated video 120. As further described below, thedetermination of locations for the insertion points 116 may be based onassessing the “discontinuity” and the “attractiveness” of a videoboundary one or more video segments, or shots, which make up the videosource 104.

FIG. 2 illustrates selected components of one example of the videoinsertion engine 114. The video insertion engine 114 may includecomputer-program instructions being executed by a computing device suchas a personal computer. Program instructions may include routines,programs, objects, components, and data structures that performparticular tasks or implement particular abstract data types. However,the video insertion engine 114 may also be implemented in hardware.

The video insertion engine 114 may include shot parser 202. The shotparser 202 can be configured to parse a video, such as video source 104into video segments, or shots. Specifically, the shot parser 202 mayemploy a boundary determination engine 204 to first divide the videosource into shots. Subsequently, the boundary determination engine 204may further ascertain the breaks between the shots, that is, shotboundaries, to serve as potential video advertisement insertion points.Some of the potential advertisement insertion points may end up beingactual advertisement insertion points 116, as shown in FIG. 1.

The example video insertion engine 114 may also include a boundaryanalyzer 206. As shown in FIG. 2, the boundary analyzer 206 may furtherinclude a discontinuity analyzer 208 and an attractiveness analyzer 210.The discontinuity analyzer 208 may be configured to analyze each shotboundary for “discontinuity,” which may further include “contentdiscontinuity and “semantic discontinuity.” As described further below,“content discontinuity” is a measurement of the visual and/or audiodiscontinuity, as perceived by viewers. “Semantic discontinuity”, on theother hand, is the discontinuity as analyzed by a mental process of theviewers. To put it another way, the “discontinuity” at a particular shotboundary measures the dissimilarity of the pair of shots that areadjacent the shot boundary.

Moreover, the attractiveness analyze 210 of the boundary analyzer 206may be configured to compute a degree of “attractiveness” for each shotboundary. In general, the degree of “attractiveness” is a measurement ofthe ability of the shot boundary to attract the attention of viewers. Asfurther described below, the attractiveness analyze 210 may use aplurality of user attention models, or mathematical algorithms, toquantify the attractiveness of a particular shot boundary. The“discontinuity” and “attractiveness” of one or more shot boundaries in asource video are then analyzed by an insertion point generator 212.

The insertion point generator 212 may detect the appropriate locationsin the time line of the source video to serve as insertion points basedon the discontinuity measurements and attractiveness measurements. Asfurther described below, in some embodiments, the insertion pointgenerator 212 may include an evaluator 214 that is configured to adjustthe weights assigned to the “discontinuity” and “attractiveness”measurements of the shot boundaries to optimize the placement of theinsertion points 116.

The video insertion engine 114 may further include an advertisementembedder 216. The advertisement embedder 216 may be configured to insertone or more video advertisements, such as the video advertisements 118,at the detected insertion points 116. According to variousimplementations, the video embedder 216 may insert the one or more videoadvertisements directly into the source video stream at the insertionpoints 116. Alternatively, the video embedder 216 may insert the one ormore video advertisements by overlaying or superimposing the videoadvertisements onto one or more frames of the video source, such asvideo source 104, at the insertion points 116. In some instances, thevideo embedder 216 may insert the one or more video advertisements byinitiating their display as separate streams concurrently with the videosource at the insertion points 116.

FIG. 3 illustrates exemplary schemes for the determination of insertionpoint locations in the timeline of a source video. Specifically, theschemes weigh the “attractiveness” and “discontinuity” measurements ofthe various shots in the video source. For these measurements, insertionpoint locations may be determined based on the balancing of“attractiveness” and “intrusiveness” considerations.

According to various implementations, “intrusiveness” of the videoadvertisement insertion point may be defined as the interruptive effectof the video advertisement on a viewer who is watching a playback of thevideo source stream. For example, a video advertisement in the form ofan “embedded” animation is likely to be intrusive if it appears during adramatic point (e.g., a shot or a scene) in the story being presented bythe source video. Such presentation of the video advertisement maydistract the viewer and detract from the viewer's ability to deriveenjoyment from the story. However, the likelihood that the viewer willnotice the video advertisement may be increased. In this way, as long asthe intrusiveness of the video advertisement insertion point does notexceed viewer tolerance, the owner of the video advertisement may deriveincreased benefit. Thus, more “intrusive” video advertisement insertionpoints weigh the benefit to advertisers more heavily than the benefit tovideo source viewers.

Conversely, if an insertion point is configured so the videoadvertisement is displayed during a relatively uninteresting part of astory presented by the video source, the viewer is likely to feel lessinterrupted. Uninteresting parts of the video source may include theends of a story or scene, or the finish of a shot. Because theseterminations are likely to be natural breaking points, the viewer mayperceive the video advertisements inserted at these points as lessintrusive. Consequently, less “intrusive” video advertisement insertionpoints place a great emphasis on the benefit to viewers than the benefitto advertisers.

According to various embodiments, the “attractiveness” of a videoadvertisement insertion point is dependent upon the “attractiveness” ofan associated video segment. Further, the “attractiveness” of aparticular video segment may be approximate by the degree that thecontent of the video segment attracts viewer attention. For example,images that zoom in/out of view, as well as images that depict humanfaces, are generally ideal for attracting viewer attention.

Thus, if a video advertisement is shown close in time with a more“attractive” video segment, the video advertisement is likely to beperceived by the viewer as more “attractive.” Conversely, if a videoadvertisement is shown close in time to a video segment that is not as“attractive”, such as a relatively boring or uninteresting segment, theviewer is likely to deem the video advertisement as less “attractive.”Because more “attractive” video advertisement insertion points aregenerally closer in time to more “attractive” video segments, having amore “attractive” insertion point may be considered to be placing agreater emphasis on the benefit to advertisers. On the other hand,because less “attractive” video advertisement insertion points areapproximate less “attractive” video segments, having a less “attractive”video advertisement insertion point may be considered to be weighing thebenefit to the viewers more heavily than the benefit to advertisers.

As shown in FIG. 3, if the attractiveness of a video segment may bedenoted by A, and the discontinuity of the video segment may be denotedby D, there are four combination scheme 302-308 that balance A and Dduring the detection of video advertisement insertion points.Additionally, α and β represent two parameters (non-negative realnumbers), each of which can be set to 3. The measurements of A and D, aswell as the combination of attractiveness and discontinuity representedby A and D, will be described below. According to various embodiments,the four combination schemes are configured to provide videoadvertisement insertion locations based on the desired “attractiveness”to “intrusiveness” proportions.

Exemplary Processes

FIGS. 4-7 illustrate exemplary processes of determination of videoadvertisement insertion points. The exemplary processes in FIGS. 4-7 areillustrated as a collection of blocks in a logical flow diagram, whichrepresents a sequence of operations that can be implemented in hardware,software, and a combination thereof. In the context of software, theblocks represent computer-executable instructions that, when executed byone or more processors, perform the recited operations. Generally,computer-executable instructions include routines, programs, objects,components, data structures, and the like that perform particularfunctions or implement particular abstract data types. The order inwhich the operations are described is not intended to be construed as alimitation, and any number of the described blocks can be combined inany order and/or in parallel to implement the process. For discussionpurposes, the processes are described with reference to the videoinsertion engine 114 of FIG. 1, although they may be implemented inother system architectures.

FIG. 4 shows a process 400 for determining video advertisement insertionpoints. At block 402, the boundary determination engine 204 of the shotparser 202 may implement a pre-processing step to parse a source video Vinto Ns shots.

According to some embodiments, the shot parser 202 may parse the sourcevideo V into Ns shots based on the visual details in each of the shots.In these embodiments, the parsing of the source video into a pluralityof shots may employ a pair-wise comparison method to detect aqualitative change between two frames. Specifically, the pair-wisecomparison method may include the comparison of corresponding pixels inthe two frames to determine how many pixels have changed. In oneimplementation, a pixel is determined to be changed if the differencebetween its intensity values in the two frames exceeds a given thresholdt. The comparison metric can be represented as a binary function DP_(i)(k, l) over the domain of two-dimensional coordinates of pixels,(k,l), where the subscript i denotes the index of the frame beingcompared with its successor. If P_(i)(k,l) denotes the intensity valueof the pixel at coordinates (k,l) in frame i, then D P_(i)(k,l) may bedefined as follows:

$\begin{matrix}{{{DP}_{i}\left( {k,l} \right)} = \left\{ \begin{matrix}1 & {{{if}\mspace{14mu}{{{P_{i}\left( {k,l} \right)} - {P_{i + 1}\left( {K,l} \right)}}}} > t} \\0 & {otherwise}\end{matrix} \right.} & (1)\end{matrix}$The pair-wise segmentation comparison method counts the number of pixelschanged from one frame to the next according to the comparison metric. Asegment boundary is declared if more than a predetermined percentage ofthe total number of pixels (given as a threshold T) has changed. Sincethe total number of pixels in a frame of dimensions M by N is M*N, thiscondition may be represented by the following inequality:

$\begin{matrix}{{\frac{{\sum\limits_{k,{l = 1}}^{M,N}{DP}},\left( {k,l} \right)}{M*N}*100} > T} & (2)\end{matrix}$

In particular embodiments, the boundary determination engine 204 mayemploy the use of a smoothing filter before the comparison of eachpixel. The use of a smoothing filter may reduce or eliminate thesensitivity of this comparison method to camera panning. The largenumber of objects moving across successive frames, as associated withcamera panning, may cause the comparison metric to judge that a largenumber of pixels as changed even if the pan entails the shift of only afew pixels. The smooth filter may serve to reduce this effect byreplacing the value of each pixel in a frame with the mean value of itsnearest neighbors.

In other embodiments, the parsing of the source video into a pluralityof shots may make use of a likelihood ratio method. In contrast to thepair-wise comparison method described above, the likelihood ratio methodmay compare corresponding regions (blocks) in two successive frames onthe basis of second-order statistical characteristics of their intensityvalues. For example, if m_(i) and m_(i+1) denote the mean intensityvalues for a given region in two consecutive frames, and S_(i) andS_(i+1) denote the corresponding variances, the following formulacomputes the likelihood ratio and determines whether or not it exceeds agiven threshold t:

$\begin{matrix}{\frac{\left\lbrack {\frac{S_{i} + S_{i} + 1}{2} + \left( \frac{m_{i} + m_{i + 1}}{2} \right)^{2}} \right\rbrack^{2}}{S_{i}*S_{i + 1}} > t} & (3)\end{matrix}$

By using this formula, breaks between shots may be detected by firstpartitioning the frames into a set of sample areas. A break betweenshots may then be declared whenever the total number of sample areaswith likelihood ratio that exceed the threshold t is greater than apredetermined amount. In one implementation, this predetermined amountis dependent on how a frame is partitioned.

In alternative embodiments, the boundary determination engine 204 mayparse the source video into a plurality of shots using an intensitylevel histogram method. Specifically, the boundary determination engine204 may use a histogram algorithm to develop and compare the intensitylevel histograms for complete images on successive frames. The principlebehind this algorithm is that two frames having an unchanging backgroundand unchanging objects will show little difference in their respectivehistograms. The histogram comparison algorithm may exhibit lesssensitivity to object motion as it ignores spatial changes in a frame.

For instance, if H_(i)(j) denote the histogram value for the i-th frame,where j is one of the G possible grey levels, then the differencebetween the i-th frame and its successor will be given by the followingformula:

$\begin{matrix}{{SD}_{i} = {\sum\limits_{j = 1}^{G}{{{H_{i}(j)} - {H_{i + 1}(j)}}}}} & (4)\end{matrix}$

In such an instance, if the overall difference SD_(i) is larger than agiven threshold T, a shot boundary may be declared. To select a suitablethreshold T, SD_(i) can be normalized by dividing it by the product of Gand M*N. As described above, M*N represents the total number of pixelsin a frame of dimensions M by N. Additionally, it will be appreciatedthat the number of histogram bins for the purpose of denoting thehistogram values may be selected on the basis of the availablegrey-level resolution and the desired computation time. In additionalembodiments, the boundary determination engine 204 may use atwin-comparison method to detect gradual transitions between shots in avideo source. The twin-comparison method is shown in FIG. 8.

FIG. 8 illustrates the use of the twin-comparison method on consecutiveframes in a video segment, such as a shot from a video source. As shownin graph 802, the twin-comparison method uses two cutoff thresholds:T_(b) and T_(s). T_(b) represents a high threshold. The high thresholdmay be obtained using the likelihood ratio method described above. T_(s)represents a low threshold. T_(s) may be determined via the comparisonof consecutive frames using a difference metric, such as the differencemetric in Equation 4.

First, wherever the difference value between the consecutive framesexceeds threshold T_(b), the twin-comparison method may recognize thelocation of the frame as a shot break. For example, the location F_(b)may be recognized as a shot break because it has a value that exceedsthe threshold T_(b).

Second, the twin-comparison method may be also capable of detectingdifferences that are smaller than T_(b) but larger than T_(s). Any framethat exhibits such a difference value is marked as the potential start(F_(s)) of a gradual transition. As illustrated in FIG. 8, the framehaving a difference value between the thresholds T_(b) and T_(s) is thencompared to subsequent frames in what is known as an accumulatedcomparison. The accumulation comparison of consecutive frames, asdefined by the difference metric SD′_(p,q), is shown in graph 804.During a gradual transition, the difference value will normallyincrease. Further, the end frame (Fe) of the transition may be detectedwhen the difference between consecutive frames decreases to less thanT_(s) while the accumulated comparison has increased to a value largerthan T_(b).

Further, the accumulated comparison value is only computed when thedifference between consecutive frames exceeds T_(s). If the consecutivedifference value drops below T_(s) before the accumulated comparisonvalue exceeds T_(b), then the potential starting point is dropped andthe search continues for other gradual transitions. By being configuredto detect the simultaneous satisfaction of two distinct thresholdconditions, the twin-comparison method is configured detect gradualbreaks between as well as ordinary breaks between shots.

In other embodiments, the shot parser 202 may parse the source video Vinto Ns shots based on the audio stream in the source video. The audiostream is the audio signals that correspond to the visual images of thesource video. In order to parse the source video into shots, the shotparser 202 may be configured to first select for features that reflectoptimal temporal and spectral characteristics of the audio stream. Theseoptimal features may include: (1) features that are selected usingmelfrequency cepstral coefficients (MFCCs); and (2) perceptual features.These features may then be combined as one feature vector afternormalization.

Before feature extraction, the shot parser 202 may convert the audiostream associated with the source video into a general format. Forexample, the shot parser 202 may convert the audio stream into an 8 KHz,16-bit, mono-channel format. The converted audio stream may bepre-emphasized to equalize any inherent spectral tilt. In oneimplementation, the audio steam of the source video may then be furtherdivided non-overlapping 25 ms-long frames for feature extraction.

Subsequently, the shot parser 202 may use eight-order MFCCs to selectfor some of the features of the video source audio stream. The MFCCs maybe expressed as:

$\begin{matrix}{{c_{n} = {\sqrt{\frac{2}{K}}{\sum\limits_{k = 1}^{K}{\left( {\log\; S_{k}} \right){\cos\left\lbrack {{n\left( {k - 0.5} \right)}{\pi/K}} \right\rbrack}}}}}{{n = 1},2,{\ldots\mspace{11mu} L}}} & (5)\end{matrix}$where K is the number of band-pass filters, S_(k) is the Mel-weightedspectrum after passing k-th triangular band-pass filter, and L is theorder of the cepstrum. Since eight-order MFCCs are implemented in theembodiment, L=8.

As described above, perceptual features may also reflect optimaltemporal and spectral characteristics of the audio stream. Theseperceptual features may include: (1) zero crossing rates (ZCR); (2)short time energy (STE); (3) sub-band powers distribution; (4)brightness, bandwidth, spectrum flux (SF); (5) band periodicity (BP),and (6) noise frame ratio (NFR).

Zero-crossing rate (ZCR) can be especially suited for discriminatingbetween speech and music. Specifically, speech signals typically arecomposed of alternating voiced sounds and unvoiced sounds in thesyllable rate, while music signals usually do not have this kind ofstructure. Hence, the variation of zero-crossing rate for speech signalswill generally be greater than that for music signals. ZCR is defined asthe number of time-domain zero-crossings within a frame. In other words,the ZCR is a measurement of the frequency content of a signal:

$\begin{matrix}{{ZCR} = {\frac{1}{2\left( {N - 1} \right)}{\sum\limits_{m = 1}^{N - 1}{{{{sgn}\left\lbrack {x\left( {m + 1} \right)} \right\rbrack} - {{sgn}\left\lbrack {x(m)} \right\rbrack}}}}}} & (6)\end{matrix}$where sgn[.] is a sign function and x(m) is the discrete audio signal,m=1 . . . N.

Likewise, Short Term Energy (STE) is the spectrum power of the audiosignal associated with a particular frame in the source video. The shotparser 202 may use the STE algorithm to discriminate speech from music.STE may be expressed as:

$\begin{matrix}{{STE} = {\log\left( {\int_{0}^{w_{0}}{{{F(x)}}^{2}{\mathbb{d}w}}} \right.}} & (7)\end{matrix}$where F(w) denotes the Fast Fourier Transform (FFT) coefficients,|F(w)|² is the power at the frequency w, and w₀ is the half samplingfrequency. The frequency spectrum may be divided into four sub-bandswith intervals

$\left\lbrack {0,\frac{w_{0}}{8}} \right\rbrack,\left\lbrack {\frac{w_{0}}{8},\frac{w_{0}}{4}} \right\rbrack,\left\lbrack {\frac{w_{0}}{4},\frac{w_{0}}{2}} \right\rbrack,{{{and}\left\lbrack {\frac{w_{0}}{2},w_{0}} \right\rbrack}.}$Additionally, the ratio between sub-band power and total power in aframe is defined as:

$\begin{matrix}{D = {\frac{1}{STE}{\int_{L_{j}}^{H_{j}}{{{F(w)}}^{2}{\mathbb{d}w}}}}} & (8)\end{matrix}$where L_(j) and H_(j) are lower and upper bound of sub-band j,respectively.

Brightness and bandwidth represent the frequency characteristics.Specifically, the brightness is the frequency centroid of the audiosignal spectrum associated with a frame. Brightness can be defined as:

$\begin{matrix}{w_{c} = \frac{\int_{0}^{w_{0}}{w{{F(w)}}^{2}{\mathbb{d}w}}}{\int_{0}^{w_{0}}{{{F(w)}}^{2}{\mathbb{d}w}}}} & (9)\end{matrix}$

Bandwidth is the square root of the power-weighted average of thesquared difference between the spectral components and frequencycentroid:

$\begin{matrix}{B = \sqrt{\frac{\int_{0}^{w_{0}}{\left( {w - w_{c}} \right)^{2}{{F(w)}}^{2}{\mathbb{d}w}}}{\int_{0}^{w_{0}}{{{F(w)}}^{2}{\mathbb{d}w}}}}} & (10)\end{matrix}$

Brightness and Bandwidth may be extracted for the audio signalassociated with each frame in the source video. The shot parser 202 maythen compute the means and standard deviation for the audio signalsassociated with all the frames in the source video. In turn, the meansand standard deviation represent a perceptual feature of the sourcevideo.

Spectrum Flux (SF) is the average variation value of spectrum in theaudio signals associated with two adjacent frames in a shot. In general,speech signals are composed of alternating voiced sounds and unvoicedsounds in a syllable rate, while music signals do not have this kind ofstructure. Hence, the SF of a speech signal is generally greater thanthe SF of a music signal. SF may be especially useful for discriminatingsome strong periodicity environment sounds, such as a tone signal, frommusic signals. SF may be expressed as:

$\begin{matrix}{{{SF} = {\frac{1}{\left( {N - 1} \right)\left( {K - 1} \right)} \times {\sum\limits_{n = 1}^{N - 1}{\sum\limits_{k = 1}^{K - 1}\left\lbrack {{\log\left( {{A\left( {n,k} \right)} + \delta} \right)} - {\log\left( {{A\left( {{n - 1},k} \right)} + \delta} \right)}} \right\rbrack^{2}}}}}{where}} & (11) \\{{A\left( {n,k} \right)} = {{\sum\limits_{m = {- \infty}}^{\infty}{{x(m)}{w\left( {{nL} - m} \right)}{\mathbb{e}}^{{- j}\;\frac{2\pi}{L}{km}}}}}} & (12)\end{matrix}$and x(m) is the is the input discrete audio signal, w(m) the windowfunction, L is the window length, K is the order of discrete Fouriertransform (DFT), δ a very small value to avoid calculation overflow, andN is the total number of frames in source video.

Band periodicity (BP) is the periodicity of each sub-band. BP can bederived from sub-band correlation analysis. In general, music bandperiodicities are much higher than those of environment sound.Accordingly, band periodicity is an effective feature in music andenvironment sound discrimination. In one implementation, four sub-bandsmay be selected with intervals

$\left\lbrack {0,\frac{w_{0}}{8}} \right\rbrack,\left\lbrack {\frac{w_{0}}{8},\frac{w_{0}}{4}} \right\rbrack,\left\lbrack {\frac{w_{0}}{4},\frac{w_{0}}{2}} \right\rbrack,{{{and}\left\lbrack {\frac{w_{0}}{2},w_{0}} \right\rbrack}.}$The periodicity property of each sub-band is represented by the maximumlocal peak of the normalized correlation function. For example, the BPof a sine wave may be represented by 1, and the BP for white noise maybe represented by 0. The normalized correlation function is calculatedfrom a current frame and a previous frame:

$\begin{matrix}{r_{i,j} = \frac{\sum\limits_{m = 0}^{M - 1}{{s_{i}\left( {m - k} \right)}{s_{i}(m)}}}{\sqrt{\sum\limits_{m = 0}^{M - 1}{s_{i}^{2}\left( {m - k} \right)}}\sqrt{\sum\limits_{m = 0}^{M - 1}{s_{i}^{2}(m)}}}} & (13)\end{matrix}$where ri,j(k) is the normalized correlation function; i is the bandindex, and j is the frame index. s_(i)(n) is the i-th sub-band digitalsignal of current frame and previous frame, when n<0, the data is fromthe previous frame. Otherwise, the data is from the current frame. M isthe total length of a frame.

Accordingly, the maximum local peak may be denoted as r_(i,j)(k_(p)),where k_(p) is the index of the maximum local peak. In other words,r_(i,j)(kp) is the band periodicity of the i-th sub-band of the j-thframe. Thus, the band periodicity may be calculated as:

$\begin{matrix}{{{bp}_{i} = {{\frac{1}{N}{\sum\limits_{j = 1}^{N}{{r_{i.j}\left( k_{p} \right)}i}}} = 1}},\ldots\mspace{14mu},4} & (14)\end{matrix}$where bp_(i) is the band periodicity of i-th sub-band, N is the totalnumber of frames in the source video.

The shot parser 202 may use noise frame ratio (NFR) to discriminateenvironment sound from music and speech, as well as discriminate noisyspeech from pure speech and music more accurately. NFR is defined as theratio of noise frames to non-noise frames in a given shot. A frame isconsidered as a noise frame if the maximum local peak of its normalizedcorrelation function is lower than a pre-set threshold. In general, theNFR value of noise-like environment sound is higher than that for music,because there are much more noise frames.

Finally, shot parser 202 may concatenated the MFCC features and theperceptual features into a combined vector. In order to do so, shotparser 202 may normalize each feature to make their scale similar. Thenormalization is processed as x′_(i)=(x_(i)−μ_(i))/σ_(i), where x_(i) isthe i-th feature component. The corresponding mean and standardderivation σ_(i) can also be calculated. The normalized feature vectoris the final representation of the audio stream of the source video.

Once the shot parser 202 has determined a final representation of theaudio stream of the video source, the shot parser 202 may be configuredto employ support vectors machines (SVMs) to segment the source videointo shots based on the final representation of the audio stream of thesource video. Support vector machines (SVMs) are a set of relatedsupervised learning methods used for classification.

In one implementation, the audio stream of the source video may intoclassified into five classes. These classes may include: (1) silence,music, (2) background sound, (3) pure speech, and (4) non-pure speech.In turn, non-pure speech may include (1) speech with music, and (2)speech with noise. Initially, the shot parser 202 may classify the audiostream into silent and non-silent segments depending on the energy andzero-crossing rate information. For example, a portion of the audiostream may be marked as silence if the energy and zero-crossing rate isless than a predefined threshold.

Subsequently, a kernel SVM with a Gaussian Radial Basis function may beused to further classify the non-silent portions of the audio stream ina binary tree process. The kernel SVM may be derived from a hyper-planeclassifier, which is represented by the equation:

$\begin{matrix}{{f(x)} = {{sgn}\left( {{\sum\limits_{i = 1}{\overset{\_}{\alpha_{i}}y_{i}x_{i}x}} + \overset{\_}{b}} \right)}} & (15)\end{matrix}$where α and b are parameters for the classifier, and the solution vectorx_(i) is called as the Supper Vector with α_(i) being non-zero. Thekernel SVM is obtained by replacing the inner product x·y by a kernelfunction K(x,y), and then constructing an optimal separating hyper-planein a mapped space. Accordingly, the kernel SVM may be represented as:

$\begin{matrix}{{f(x)} = {{sgn}\left( {{\sum\limits_{i = 1}^{l}{\overset{\_}{\alpha_{i}}y_{i}{K\left( {x_{i},x} \right)}}} + \overset{\_}{b}} \right)}} & (16)\end{matrix}$Moreover, the Gaussian Radial Basis function may be added to the kernelSVM by the equation

${K\left( {x,y} \right)} = {\exp - {\frac{{{x - y}}^{2}}{2\sigma^{2}}.}}$

According to various embodiments, the use of the kernel SVMs with theGaussian Radial Basis function to segment the audio stream, and thus thevideo source corresponding to the audio stream into shots, may becarried out in several steps. First, the video stream is classified intospeech and non-speech segments by a kernel SVM. Then, the non-speechsegment may be further classified into shots that contain music andbackground sound by a second kernel SVM. Likewise, the speech segmentmay be further classified into pure speech and non-pure speech shots bya third kernel SVM.

It will be appreciated that while some methods for detecting breaksbetween shots in a video source has been illustrated and described, theboundary determination engine 204 may carry out the detection of breaksusing other methods. Accordingly, the exemplary methods discussed aboveare intended to be illustrative rather than limiting.

Once the source video V is parsed into N_(s) shots using one of themethods described above, s_(i) may be used to denote the i-th shot in V.Accordingly, V={s_(i)}, wherein i=1, . . . , N_(s). As a result, thetotal number of candidate insertion points may be represented by(N_(s)+1). The relationships between candidate insertion points and theparsed shots are illustrated in FIG. 9.

FIG. 9 illustrates the proximity of the video advertisement insertionpoints to the parsed shots S_(Ns) in the source video V. As shown, theinsertion points 902-910 are distributed between the shots 912-918.According to various embodiments, the candidate insertion potionscorrespond to shot breaks between the shots.

At block 404, the discontinuity analyzer 208 of the boundary analyzer206 may determine the overall discontinuity of each shot in the sourcevideo. Specifically, in some embodiments, the overall discontinuity ofthe each shot may include a “content discontinuity.” Contentdiscontinuity measures the visual and or audile perception-baseddiscontinuity, and may be obtained using a best first model merging(BFMM) method.

Specifically, given the set of shots {s_(i)} (i=1, . . . , Ns) in asource video, the discontinuity analyzer 208 may use a best first modelmerging (BFMM) method to merge the shots into a video sequence. Asdescribed above, the shots may be obtained based on parsing a sourcevideo based on the visual details of the source video. Alternatively,the shots may be obtained by segmentation of the source video based onthe audio stream that corresponds to the source video. Accordingly, thebest first model merging (BFMM) may be used merge the shots in sequencebased on factors such as color similarity between shots, audiosimilarity between shots, or a combination of these factors.

The merging of the shots may be treated as an ordered clustering of agroup of consecutive time constrained shots. In one implementation, theshots of the video source may be regarded as a hidden state variable.Accordingly, the clustering problem may then involve the looping ofboundaries between shots as the probabilistic of transition to the nextstates. According to various embodiments, probabilistic clustering maybe used to carry out BFMM. Probabilistic clustering creates aconditional density model of the data, where the probability ofgenerating a shot depends conditionally upon the cluster membership ofthe shot. In such an instance, the clustering metric is the likelihoodof the data being generated by the model. In other words, BFMM is basedon a formulation of maximum likelihood.

As shown in FIG. 10, the probabilistic model used may be a Left-RightHidden Markov Model (HMM) 1000. Membership in the video sequence is ahidden state variable. For every state, there is a generative model forshots in that state. This generative model is content based. This meansthat a probability of generating the data of the shot, such as one ofthe shots 902-906, is conditioned on the state. In addition, for everystate, there is a probability of transitioning to the next state whenpresented with a new shot.

BFMM may be initiated with every shot having its own model. Merging apair of adjacent models causes a loss of data likelihood, because onecombined model is more general and cannot fit the data as well as twoindividual models. For example, if L_(x) is the log likelihood of all ofthe images assigned to a shot x given an associated model of x, and ifshot x and y are being merged to form video sequence z, then the changein likelihood associated with the merging is:ΔL=L _(z) −L _(x) −L _(y)  (17)

At each merging step, the Best-First Model Merging algorithm mayselectively merge the two adjacent shots with the most similarity, sothat the merge causes the least loss of data likelihood. Additionally, amerging order may be assigned to the boundary between the two mergedshots. This merging step may be repeated until all the shots in V aremerged into one video sequence.

In this way, if P={p_(i)} (i=2, . . . , N_(s)) denote the set ofinserting points, a merging order for each shot break p_(i) (i=2, . . ., N_(s)−1) may be obtained. Moreover, the content discontinuity D_(c)for each shot break p_(i) may be further calculated. For example, if theshots s₁ and s_(i+1) are merged in the k step, then P_(i)=k and thecontent discontinuity D_(c) may be given as D_(c)(p_(i))=k/(N_(s)+1).

An exemplary pseudo code for the determination of content discontinuity,which may be carried out by the discontinuity analyzer 208, is givenbelow:

  Input: S={s_(i)} i=1, ..., N_(s), P={p_(j)} j=1, ..., N_(s)+1  Output: D={D_(c)(p_(j))} j=1, ..., N_(s)+1 1.   Initialize   setD_(c)(p_(Ns+1)) = 1.00, D_(c)(p₁) = 0.99 2.   Preprocess   for σ=1 to 4do   Compute similarity Sim(s_(i), s_(i+σ)) for each pair with scale σ  if Sim(s_(i), s_(i+σ))<T_(s) do   Merge {s_(k)} (k=i+1, ..., i+σ) tos_(i) and Remove {s_(k)} (k=i+1, ..., i+σ)   from S   {D_(c)(p_(k))}(k=i+1, ..., i+σ)=0   N_(s) = N_(s) − (σ−1)   end if   end for 3.   BFMM  set merging order N_(m)=1   while N_(s)>0 do   Compute Sim(·) foradjacent shots and get the closest pair (s_(i), s_(i+1))   Merge s_(i+1)to s_(i) and Remove s_(i+1) from S   D_(c)(p_(i+1)) = N_(m)   N_(m) ++,N_(s) −−   end while 4.   Normalize   for j=2 to N_(s) do   D_(c)(p_(j))= D_(c)(p_(j)) / N_(m)   end for

In other embodiments, the overall discontinuity of the each shot, inaddition to “content discontinuity” may include “semanticdiscontinuity.” Semantic discontinuity may be derived using conceptdetectors that are based on Support Vector Machines (SVM). Supportvector machines (SVMs) are a set of related supervised learning methodsused for classification. Each of the concept detectors is configured todetect a particular attribute of a shot. For instance, the discontinuityanalyzer 208 may include concept detectors that may be configured todetect various attributes. These attributes may include whether the shotis an indoor shot, whether the shot shows an urban scene, whether theshot depicts a person, whether the shot depicts a water body, etc. Inother instances, the discontinuity analyzer 208 may also include conceptdetectors that are used to detect attributes such as camera motions.These camera motions may include affine motions such as static, pan,tilt, zoom, rotation, as well as object motions. Nevertheless, it willbe appreciated that the discontinuity analyzer 208 may includeadditional concept detectors that are configured to detect otherattributes of the shots, including other types of motions.

In these embodiments, each concept detector may be further configured tooutput a confidence score for a shot. The confidence score indicates thedegrees of correlation between the shot and the corresponding attributedetected by the concept detector. Accordingly, for each shot, theplurality of confidence scores may be composed into a multi-dimensionalvector. The semantic discontinuity for each shot break, D_(s)(p_(i)) maythen be calculated by the distance, that is, difference, between twomulti-dimensional vectors adjacent the an insertion point p_(i)

Accordingly, “content discontinuity” and “semantic discontinuity” may becalculated for two types of shots breaks. In one instance, “contentdiscontinuity” and “semantic discontinuity” may be calculated for shotbreaks belonging to shots that were parsed using visual details of thesource video (visual shot boundaries). In another instance, “contentdiscontinuity” and “semantic discontinuity” may be calculated for shotbreaks belonging to shots that were parsed using the audio stream of thesource video (audio shot boundaries).

Moreover, in embodiments where the overall discontinuity includes both“content discontinuity” and “semantic discontinuity”, an overalldiscontinuity, D(p_(i)), may be calculated as the average of “contentdiscontinuity”, D_(c)(p_(i)), and “semantic discontinuity,” D_(s)(p_(i))In additional embodiments, the overall discontinuity, D(p_(i)), may becalculated as the sum of weighted of “content discontinuity”,D_(c)(p_(i)), and “semantic discontinuity,” D_(s)(p_(i)), as shown bythe equation:D(p _(i))=λD _(c)(p _(i))+(1−λ)D _(s)(p _(i))  (18)wherein λ is a number between 0 and 1.

At block 406, the attractiveness analyze 210 of the boundary analyzer206 may determine the attractiveness of each shot in the video source.In one embodiment, attractive of each shot may be based on an estimateof the degree which the content of the shot will attract viewerattention. In other words, the ability of a particular shot to attractviewer attention may be used as an approximation of attractiveness.

According to various implementations, a shot in the video source may beconsidered to be a compound of image sequence, audio track, and texturalinformation. The image sequences in the shot may present motion (objectmotion and camera motion), color, texture, shape, and text. The audiochannels may consist of speech, music, and various sound effects.Textural information in linguistic form may be obtained from sourcessuch as closed captioning, automatic speech recognition (ASR), andsuperimposed text.

FIG. 5 shows a process 500 that gauges viewer attention. At block 502,the attractiveness analyze 210 may analyze a shot to extract contentfeatures. For example, the shot may be broke down into visual elements,aural elements, and linguistic elements. At block 504, theattractiveness analyze 210 may use a set of attention models to generateseparate attention curves for the elements.

For instance, attention to visual elements may be modeled by severalattention models. These models may include a motion attention model, astatic attention model, a semantic attention model (face attentionmodel), and a guided attention model (camera motion model). FIG. 6further illustrates block 504 of the process 500 by depicting the use ofthese models.

FIG. 6 shows an exemplary process 600 for using the various attentionmodels. At block 602, the motion attention model may be implemented bythe attractiveness analyze 210. In the motion attention model, motionattention is estimated based on a motion vector field (MVF). A MVF canbe obtained by block-based motion estimation. In other words, for agiven frame in a shot, the motion field between the frame and the nextframe may be extracted and calculated as a set of motioncharacteristics. In one implementation, when the shot is stored in MPEGformat, the MVFs may be extracted from the MPEG data directly.

A MVF may have three inductors: intensity inductor, spatial coherenceinductor, and temporal coherence inductor. When motion vectors in theMVFs pass these inductors, they may be transformed into three kinds ofvisual maps. The visual maps may include intensity map, spatialcoherence map, and temporal coherence map, respectively. The normalizedoutputs from the three inductors are fused into a saliency map. Thesaliency map may indicate spatial-temporal distribution of motionattentions.

The MVF may include macro blocks, each macro block may correspond to thethree inductors describe above. The intensity inductor at each macroblock induces motion energy or activity, which may be represented bymotion intensity I. Specifically, given a MVF with M*N macro blocks, themotion intensity at each macro block MB_(i,j,)(0≦i≦M, 0≦i≦N) may becomputed as the magnitude of motion vectors:I(i,j)=√{square root over (dx ² _(i,j) +dy ² _(i,j))}/MaxMag  (19)where (dx_(i,j), dy_(i,j)), denote two components of motion vector alongthe x-axis and y-axis, respectively, and where MaxMag is thenormalization vector.

The spatial coherence inductor induces the spatial phase consistency ofmotion vectors. The regions with consistent motion vectors are mostlikely within a moving object. In contrast, the motion vectors withinconsistent phase are often located at object boundaries. Spatialcoherency may be measured using an entropy based method. First, a phasehistogram in a spatial window with the size of w*w (pixels) may becomputed at each location of a macro block. Then, the coherence of phasedistribution, Cs, may be measured by entropy:

$\begin{matrix}{{{Cs}\left( {i.j} \right)} = {- {\sum\limits_{t = 1}^{n}{{p_{s}(t)}{{Log}\left( {p_{s}(t)} \right)}}}}} & (20) \\{{P_{s}(t)} = \frac{{SH}_{i.j}^{w}(t)}{\sum\limits_{k = 1}^{n}{{SH}_{i.j}^{w}(k)}}} & (21)\end{matrix}$where SH_(i,j) ^(w)(t) is a spatial phase histogram and thecorresponding probability distribution function of spatial phase isp_(s)(t), and n is the number of histogram bins.

Similar to the spatial coherence inductor, temporal coherency, Ct, orthe output of a temporal coherence inductor, may be defined in a slidingwindow with the size of L (frames) as follows:

$\begin{matrix}{{{Ct}\left( {i.j} \right)} = {- {\sum\limits_{t = 1}^{n}{{p_{t}(t)}{{Log}\left( {p_{t}(t)} \right)}}}}} & (22) \\{{P_{t}(t)} = \frac{{TH}_{i.j}^{L}(t)}{\sum\limits_{k = 1}^{n}{{TH}_{i.j}^{L}(k)}}} & (23)\end{matrix}$where TH_(i,j) ^(L)(t) is a temporal phase histogram and thecorresponding probability distribution function of temporal phase isp_(t)(t), and n is the number of histogram bins.

In this manner, motion information from the three inductors I, Cs, andCt may be obtained. The outputs from the three inductors I, Cs, and Ctcan be used to characterize the dynamic spatial-temporal attribute ofmotion. Accordingly, motion attention may be defined as follows:B=I×Ct×(1−I×Cs)  (24)

Moreover, by using this equation, the outputs from I, Cs, and Ctinductors may be integrated into a motion saliency map, in which areaswith attention-attracting motion may be precisely identified.

For instance, additional image processing procedures may be employed todetect the salient motion regions. These additional image processingprocedures may include Histogram Balance, Median Filtering,Binarization, Region Growing, and Region selection. Once motiondetection is complete, a motion attention model may be calculated byaccumulating the brightness of the detected motion regions in a saliencymap as follows:

$\begin{matrix}{M_{motion} = \frac{\sum\limits_{r \in \Lambda}{\sum\limits_{q \in \Omega_{r}}B_{q}}}{N_{MB}}} & (25)\end{matrix}$wherein B_(q) is the brightness of a macro block in the saliency map, Λis the set of attention-attracting areas caused by motion activities,Ω_(r) denotes the set of macro blocks in each attention area, and N_(MB)is the number of macro blocks in a MVF that is used for normalization.The motion attention value of each frame in the shot, M_(motion), may beused to form a continuous attention curve as a function of time.

At block 604, the attractiveness analyze 210 may implement the staticattention model. The static attention model measures the ability ofstatic region of images to attract view attention. According to variousembodiments, the static attention model may be based on a contrast-basedsaliency map. This is because contrast is an important parameter inassessing vision. Whether an object can be perceived or not depends onthe distinctiveness between itself and environment. Consequently, thecontrast-based saliency map may be based on a generic contrastdefinition. For example, given an image with M*N pixels or blocks, thecontrast value at a pixel or block is defined as follows:

$\begin{matrix}{C_{i,j} = {\sum\limits_{q \in \Theta}{d\left( {P_{i,j},q} \right)}}} & (26)\end{matrix}$where p_(i,j) (iε[0,M], jε[0, N]) and q are representations ofappearance at pixel/block (i,j) appearance at pixel/block, such ascolor. Θ is the neighborhood of p_(i,j) and its size is able to controlthe sensitivity of the contrast measure. Moreover, d is the differencebetween p_(i,j) and q, which may be any distance measurement required bydifferent requirements. If all contrasts C_(i,j) are normalized, asaliency map may be formed. Such contrast-based saliency map mayrepresent color, texture, and approximate shape informationsimultaneously.

Once a saliency map is developed, it may be further processed to showattention-attracting areas as bright areas on a gray-level map. Thesize, position, and the brightness of the attention-attracting areas inthe gray saliency map indicate the degree of viewer attention theyattract. Accordingly, a static attention model is computed based on thenumber of the attention-attracting areas as well as the brightness,area, and position of the attention-attracting areas:

$\begin{matrix}{M_{static} = {\frac{1}{A_{frame}}{\sum\limits_{K = 1}^{N}{\sum\limits_{{({i,j})} \in R_{k}}{B_{i,j} \cdot w_{pos}^{i,j}}}}}} & (27)\end{matrix}$where B_(i,j) denotes the brightness of the pixels in theattention-attracting area R_(k), N denotes the number of theattention-attracting area, A_(frame) is the area of the frame, andw_(pos) ^(i,j) is a normalized Gaussian template with the center locatedat the center of the frame. Since viewers usually pay more attention tothe areas near to frame center, this normalized Gaussian templateassigns a weight to each pixel or block in this region. The staticattention value of each frame in the shot, M_(static), may then be usedto form a continuous attention curve as a function of time.

At block 606, the attractiveness analyze 210 may implement the faceattention model. The face attention model may be employed to assess theattention-attracting ability of faces in a shot. The appearances ofdominant faces in a video usually attract viewer attention. In otherwords, the position and size of a face in a shot reflects importance ofthe face. Correspondingly, the position and size of a face also reflectsthe importance of the frame containing the face. Thus, face attentionmay be modeled as:

$\begin{matrix}{M_{face} = {\sum\limits_{K = 1}^{N}{\frac{A_{k}}{A_{frame}} \times \frac{w_{posi}^{i}}{8}}}} & (28)\end{matrix}$wherein A_(k) denotes the size of K^(th) face, A_(frame) denote the areaof the frame, w_(pos) ^(i) is the weight of a position as shown in FIG.6, and iε[0,8] is the index of the position.

At block 608, the attractiveness analyze 210 may implement the cameraattention model. The camera motion model may be configured to transformcamera motion variation, including motion type, direction and velocityinto an attention curve. For example, camera motion may be classifiedinto the following types: (1) panning and tilting, or camera rotationaround the x-axis and y-axis; (2) rolling, resulting from camerarotations around a z-axis; (3) tracking and booming, resulting fromcamera displacement along the x-axis and the y-axis; (4) dollying,resulting from camera displacement along the z-axis; (5) zooming,resulting from focus adjustment, and (6) stills.

According to various embodiments, the attention values caused by theabove described camera motions are quantified to the range of [0˜2]. Anattention value higher than “1” means emphasis, and an attention valuesmaller than “1” indicates neglect. An attention value that is equal to“1” indicates that the camera did not intend to attract the viewer'sattention.

In some embodiments, camera attention may be modeled based on thefollowing assumptions: (1) zooming and dollying are always used foremphasis. Specifically, the faster the zooming/dollying speed, the moreimportant the focused content; (2) horizontal panning indicates neglect;(3) other camera motions have no obvious intention; and (4) if thecamera motion changes too frequently, the motion is considered random orunstable. Accordingly, camera motions other than zooming, dollying, andhorizontal panning, as well as rapid camera motion changes, may beassigned a value of “1”.

FIG. 11 illustrates exemplary models of camera motion. Specifically,graph 1002 shows zooming. As shown in graph 1102, the attention degreeis assigned to “1” when zooming is started, and the attention degree ofthe end part of the zooming is a direct ratio of the speed of zoomingV_(z). As shown in graph 1104, if a camera becomes still after azooming, the attention degree at the end of the zooming will continuefor a certain period of time t_(k), and then return to “1”.

Additionally, the attention degree of panning is determined by twoaspects; the speed V_(p) and the direction γ. Thus, the attention degreemay be modeled as the product of the inverse speed and the quantizationfunction of direction, as shown in graph 1106. Further, as shown ingraph 1108, the motion direction γε[0˜π/2] may be mapped to [0˜2] by asubsection function. Specifically, 0 is assigned to the direction γ=π/4,1 is assigned to γ=0, and 2 is assigned to the direction γ=π/2. Thus,the first section is monotonously decreasing while the second section ismonotonously increasing.

Graph 1110 shows panning followed by still. In such a scenario, theattention degree may continue for a certain period of time t_(k), afterpanning has transitioned into zooming. Moreover, the attention degreewill be an inverse ratio to the speed of panning V_(p). Other models ofthe relationship between camera motion and attention degree are furtherillustrated in FIG. 11. For example, graph 1112 shows still and othertypes of camera motion, graph 1114 shows zooming followed by panning,graph 1116 shows panning followed by zooming, and graph 1118 shows stillfollowed by zooming.

Audio attention is an important parts of the overall attention modelframe work 400. For example, speech and music are semanticallymeaningful for human beings. Loud and sudden noises also tend to grabattention. Accordingly, the definition of audio attention may includethree audio attention models. These models include an audio saliencyattention model, a speech attention model, and a music attention model.According to various embodiments, these models may be implemented by theattractiveness analyze 210. FIG. 7 further illustrates block 406 of theprocess 400 by depicting the use of these audio attention models.

FIG. 7 illustrates an exemplary process 700 for using the variousattention models. At block 702, the attractiveness analyze 210 mayimplement the audio saliency attention model. The audio saliencyattention model may correlate audience attention with the amount ofsound energy, or loudness of the sound. For example, viewers are oftenattracted to loud or sudden sounds. In various implementations, it isassumed that an individual may pay attention to a sound if: (1) anabsolute loud sound, as measured by average energy of the sound, occurs;and (2) the sudden increase or decrease of a sound occurs, as measuredby the average energy of the sound. Consequently, the aural saliencymodel is computed by:M _(as) =Ē _(a) ·Ē _(p)  (29)where Ē_(a) and Ē_(p) are the two components of audio saliency:normalized average energy and normalized energy peak in an audiosegment. They are calculated as follows, respectively:

$\begin{matrix}{{\overset{\_}{E}}_{a} = {{{\overset{\_}{E}}_{avr}/{Max}}\; E_{avr}}} & (30) \\{{\overset{\_}{E}}_{a} = {{{\overset{\_}{E}}_{peak}/{Max}}\; E_{peak}}} & (31)\end{matrix}$where E_(avr) and E_(peak) denote the denote the average energy andenergy peak of an audio segment, respectively. MaxE_(avr) andMaxE_(peak) are the maximum average energy and energy peak of an entireaudio segment corps. In particular embodiments, a sliding window may beused to compute audio saliency along an audio segment.

At block 704, the attractiveness analyze 210 may implement the speechand music attention models. The speech and music attention models may beused to correlate audience attention with speech and music. In general,audience usually pays more attention shots that are accompanied byspeech/music. Accordingly, the saliency of speech/music is measured bythe ratio of speech/music to other sounds. In one implementation, anaudio segment accompanying a shot may be divided into sub segments.Feature may then be extracted from the sub segments. The features mayinclude mel-frequency cepstral coefficients (MFCCs) and a number ofperceptual features, such as short time energy (STE), zero crossingrates (ZCR), sub-band powers distribution, brightness, bandwidth,spectrum flux (SF), and band periodicity (BP). A support vector machine(SVC) may be employed to classify each audio sub segment into speech,music, silence, and other audio components. In this manner, speech ratioand music ratio may be computed as:

$\begin{matrix}{M_{speech} = {N_{speech}^{w}/N_{total}^{w}}} & (32) \\{M_{music} = {N_{music}^{w}/N_{total}^{w}}} & (33)\end{matrix}$where N_(speech) ^(w), N_(music) ^(w), and N_(total) ^(w) are the numberof speech sub segments, music sub segments, and total sub segments in asliding window w, respectively. Once block 704 is completed, the process700 may return to block 508 of the process 500.

At block 508, the attractiveness analyze 210 may fuse the various visualand aural models into a final attention curve. The final attention curvemay represent the attractiveness of a shot. According to variousembodiments, the fusion of curves from the various attention models maybe carried as a linear combination or a nonlinear combination.

In the linear fusion scheme, the curves from the various models may befirst normalized. Subsequently, linear fusion may be implementedaccording to the following:A=w _(v) · M _(v) +w _(a) · M _(a) +w _(l) · M _(l)  (34)wherein w_(v), w_(a), w_(l)≧0 (w_(v)+w_(a)+w_(l)=1) are the weights forlinear combination, and M _(v), M _(a), and M _(l) are normalize visual,aural, and linguistic attention models, respectively. M _(v), M _(a),and M _(l) may be computed as:

$\begin{matrix}{M_{v} = {\left( {\sum\limits_{i = 1}^{p}{w_{i} \cdot {\overset{\_}{M}}_{i}}} \right) \times \left( {\overset{\_}{M}}_{cm} \right)^{S_{cm}}}} & (35) \\{M_{a} = {\left( {\sum\limits_{j = 1}^{q}{w_{j} \cdot {\overset{\_}{M}}_{j}}} \right) \times \left( {\overset{\_}{M}}_{as} \right)^{S_{as}}}} & (36) \\{M_{i} = \left( {\sum\limits_{K = 1}^{r}{w_{k} \cdot {\overset{\_}{M}}_{k}}} \right)} & (37)\end{matrix}$where w_(i), w_(j), and w_(k) are internal linear combination weights invisual, aural, and linguistic attention models, respectively, with theconstraints of w_(i)≧0, w_(j)≧0, w_(k)≧0, and

${{\sum\limits_{j = 1}^{q}w_{j}} = 1},{{\sum\limits_{k = 1}^{r}w_{k}} = 1.}$Further, M _(v), M _(v), and M _(v) denote the normalized visual, auraland linguistic attention models, respectively. M _(cm) is the normalizedcamera motion model that is used as a magnifier in visual attentionmodel. S_(cm) works as the switch for this magnifier. If S_(cm)≧1, themagnifier is turned on. Otherwise, S_(cm)<1 means that the magnifier isturned off. The higher the value of S_(cm), the more evident the effectsof magnifier are. Similarly, M _(as) is the normalized aural saliencymodel and used as a magnifier for the aural attention models. S_(as) mayact as a switch for aural saliency. As magnifiers, S_(as) and S_(cm) areall normalized to [0˜2]. In this manner, users may adjust the weights inequations 14-17 according to their preferences or application scenarios.

The nonlinear fusion scheme may be especially suitable for situationswhere one or more attention components may have relatively highattention values, but other attention components have very low values.For example, a shot with high motion attention but very low auralattention.

On the other hand, the nonlinear fusion scheme is a monotone increasingfunction. Specifically, if n is the number of features in a shot, thefeature vector may be denoted by {right arrow over (x)}=(x₁, x₂, . . .x_(n)), where 0≦x_(i)≦1, 1≦i≦n, and the fusion function be denoted byf(x) or f(x₁, x₂, . . . x_(n)), the nonlinear fusion scheme shouldsatisfy two criteria (equation 38) and (equation 39) for the case ofn=2:f(x ₁ ,x ₂)<f(x ₁ +ε,x ₂−ε)  (38)where 0<ε≦x₂≦x₁, andf(x ₁ ,x ₂)<f(x ₁ +ε,x ₂)  (39)where 0<ε.

Accordingly, a multi-dimensional Attention Fusion Function (AFF) may beobtained using:

$\begin{matrix}{{A\; F\;{F_{n}^{(\gamma)}\left( \overset{\rightharpoonup}{x} \right)}} = {{E\left( \overset{\rightharpoonup}{x} \right)} + {\frac{1}{{2\left( {n - 1} \right)} + {n\;\gamma}}{\sum\limits_{K = 1}^{n}{{x_{k} - {E\left( \overset{\rightharpoonup}{x} \right)}}}}}}} & (40)\end{matrix}$where γ>0 is a constant, and E({right arrow over (x)}) is the mean offeature vector ({right arrow over (x)}), and wherein the followinginequalities (41) and (42) are satisfied:AFF _(n) ^((γ))(x ₁ , . . . , x _(i) , . . . , x _(n))<AFF _(n) ^((γ))(x₁ , . . . x _(i) +ε, . . . , x _(n))  (41)where 1≦i≦n, ε≧n, andAFF _(n) ^((γ))(x ₁ , . . . , x _(i) , . . . , x _(j) , . . . , x_(n))≦AFF _(n) ^((γ))(x ₁ , . . . , x _(i) +ε, . . . , x _(j) −ε, . . ., x _(n))  (42)where 1≦i<j≦n, x_(i)≧x_(j)≧e>0.

At block 510, once an attention curve have been obtained using one ofthe linear and nonlinear fusion schemes, the attractiveness analyze 210may extract the key frames and video segments from around the crests ofthe curve. According to the definition of the user attention model, thecrests on the attention curve generally indicate the corresponding keyframe or segment that is most likely to attract viewer attention.Moreover, a derivative curve may be generated based on attention todetermine the precise position of the crests. The zero-crossing pointsfrom positive to negative on the derivative curve are the location ofwave crests. Based on the wave crests, a multi-scale static abstractionmay be generated to rank the key frames according to the attention valueof each frame.

In some implementations, shots in a video source may be further rankedaccording to the key frames. For example, key frames between two shotboundaries may be used as representative frames of a shot. In such anexample, the combined values of the key frames may be used to rank theshot. Additionally, in cases where there is no crest in a shot, themiddle frame is chosen as the key frame, and the important value of thisshot is assigned to zero.

In other instances, each shot in a video source may include only one keyframe. In such instances, the ranking of each shot may be based on theattention value of the key frame. Moreover, if the total number of keyframes present, as indicated by the crests, is more than the number ofshots in a video, shots with lower importance values will be ignored.Once the attention value for each shot in the video source has beenobtained, the process 500 may return to block 406.

At block 406, the attractiveness analyze 210 may ascertain theattractiveness A(p_(i)), by computing the linear combination ofattractiveness degrees from two adjacent shots besides a shot boundaryp_(i) as follows:A(p _(i))=λ×A(s _(i))+(1−λ)×A(s _(i+1))  (43)where λ is between 0 and 1.

At block 408, the insertion point generator 212 may determine the videoadvertisement insertion points based on the discontinuity andattractiveness of the shot boundary p_(i) (i=2, . . . N). In embodimentswhere the shot boundaries include both visual shot boundaries and theaudio shot boundaries, the insertion point generator 212 may firstlinearly combine the discontinuities of the visual and audio shotboundaries. In other embodiments, the insertion point generator 212 maydirectly detect the video advertisement insertion points. The detectionof video advertisement insertion points can be formalized as findingpeaks on a curve that is the linear combination of discontinuity andattractiveness, i.e., αA+βD, where −1<α and β<1. Specifically, differentadvertising strategies may lead to the selection of different α and β.For example, α may be set to greater than β, or α>β, to provider morebenefits to advertisers than viewers. On the other hand, a may be set toless than β, or α>β, to provide more benefits to viewers thanadvertisers.

In alternative embodiments, the discontinuity/attractiveness evaluator214 of the insertion point generator 214 may be employed to set videoadvertisement insertion points for the investigation of viewer toleranceto the inserted video advertisements, as well as the effectiveness ofthe video advertisements from the point of advertisers. For example, aset of different parameters (i.e., α and β) may be selected forgenerating a combined curve. Video advertisements are then inserted intoa video source based on the combined curves. Viewers may then view thevideo source and provide feedbacks as to their reaction to theadvertisement integrated source video (e.g., enjoyable, okay, annoyed,etc.). Moreover, advertisers may be asked to assess the effectiveness ofthe inserted video advertisements. For example, if the selectedinsertion points for the video advertisements are effective, theadvertisers may observe an increase in website traffic or telephonecalls originating from the inserted video advertisements. At block 410,the advertisement embedder 216 may insert at least one videoadvertisement at each of the detected insertion points.

The determination of optimal video advertisement insertion points basedon the concept of “attractiveness” and “intrusiveness” may enableadvertisers to maximize the impact of their video advertisements.Specially, optimal video advertisement insertion points may focus theattention of the viewers on the advertisements while keeping negativereactions from most viewers, such as irritation or annoyance, to aminimum. In this manner, content providers, advertisers, and the generalpublic may all benefit. For example, providers of video sources, orcontent providers, may upload creative content. Meanwhile, advertisersmay reach an audience by seamlessly integrating video advertisementsinto the uploaded content source. In turn, the content providers maythen receive compensation for their creative efforts by receiving ashare of the advertising revenue while distributing their content toviewers at no cost.

Exemplary System Architecture

FIG. 12 illustrates a representative computing environment 1200 that maybe used to implement the insertion point determination techniques andmechanisms described herein. However, it will readily appreciate thatthe various embodiments of the transformation techniques may beimplemented in different computing environments. The computingenvironment 1200 shown in FIG. 12 is only one example of a computingenvironment and is not intended to suggest any limitation as to thescope of use or functionality of the computer and network architectures.Neither should the computing environment 1200 be interpreted as havingany dependency or requirement relating to any one or combination ofcomponents illustrated in the example computing environment.

As depicted in FIG. 12, the exemplary computing environment 1200 mayinclude a computing device 1202 having one or more processors 1206. Asystem memory 1208 is coupled to the processor(s) 1206 by one or morebuses 1210. The one or more buses 1210 may be implemented using any kindof bus structure or combination of bus structures, including a memorybus or memory controller, a peripheral bus, an accelerated graphicsport, and a processor or local bus using any of a variety of busarchitectures. It is appreciated that the one or more buses 1210 providefor the transmission of computer-readable instructions, data structures,program modules, and other data encoded in one or more modulated carrierwaves. Accordingly, the one or more buses 1210 may also be characterizedas computer-readable mediums.

The system memory 1208 may include both volatile and non-volatilememory, such as random access memory (RAM) 1212, and read only memory(ROM) 1214. The environment 1200 also includes one or more mass storagedevices, which may also be characterized as mass storage typeinput/output devices, may include a variety of types of volatile andnon-volatile media, each of which can be removable or non-removable. Forexample, the mass storage devices may include a hard disk drive 1218 forreading from and writing to a non-removable, non-volatile magneticmedia, a magnetic disk drive 1220 for reading from and writing to aremovable, non-volatile magnetic disk 1222 (e.g., a “floppy disk”), andan optical disk drive 1224 for reading from and/or writing to aremovable, non-volatile optical disk 1226 such as a compact disk (CD),digital versatile disk (DVD), or other optical media. Although notshown, the one or more mass storage devices may also include other typesof computer-readable medium, such as magnetic cassettes or othermagnetic storage devices, flash memory cards, electrically erasableprogrammable read-only memory (EEPROM), or the like. The hard disk drive1218, magnetic disk drive 1220, and optical disk drive 1224 may each beconnected to the system bus 1210 by one or more data media interfaces1228. Alternatively, the hard disk drive 1218, magnetic disk drive 1220,and optical disk drive 1224 may be coupled to the system bus 1210 by aSCSI interface (not shown), or other coupling mechanism.

In addition to the mass storage type input/output devices describedabove, the environment 1200 includes various input/output devices suchas a display device 1204, a keyboard 1238, a pointing device 1240 (e.g.,a “mouse”) and one or more communication ports 1250. In furtherembodiments, the input/output devices may also include speakers,microphone, printer, joystick, game pad, satellite dish, scanner, cardreading devices, digital or video camera, or the like. The input/outputdevices may be coupled to the system bus 1210 through any kind ofinput/output interface 1242 and bus structures, such as a parallel port,serial port, game port, universal serial bus (USB) port, video adapter1244 or the like.

The computing environment 1200 may further include one or moreadditional computing devices 1246 communicatively coupled by one or morenetworks 1248. Accordingly, the computing device 1202 may operate in anetworked environment using logical connections to one or more remotecomputing devices 1246. The remote computing device 1246 can compriseany kind of computer equipment, including personal computers, servercomputers, hand-held or laptop devices, multiprocessor systems,microprocessor-base systems, set top boxes, game consoles, programmableconsumer electronics, network PCs, minicomputers, and mainframecomputers. The remote computing devices 1246 may include all of thefeatures discussed above with respect to computing device 1202, or somesubset thereof. The networked environment may further be utilized toimplement a distributed computing environment. In a distributedcomputing environment, computing resources can be physically dispersedthroughout the environment.

Any type of network 1248 can be used to couple the computing device 1202with one or more remote computing devices 1246, such as a wide-areanetwork (WAN), a local area network (LAN), and/or the like. Thecomputing device 1202 may be coupled to the network 1248 via acommunication port 1250, such as a network interface card. Thecommunication port 1250 may utilize broadband connectivity, modemconnectivity, DSL connectivity, or other connection strategy. Althoughnot illustrated, the computing environment 1200 may also providewireless communication functionality for connecting computing device1202 with remote computing devices 1246 (e.g., via modulated radiosignals, modulated infrared signals, etc.). It is appreciated that theone or more networks 1248 provide for the transmission ofcomputer-readable instructions, data structures, program modules, andother data encoded in one or more modulated carrier waves.

Generally, one or more of the above-identified computer-readable mediumsprovide storage of computer-readable instructions, data structures,program modules, and other data for use by the computing device 1202.For instance, one or more of the computer-readable mediums may store theoperating system 1230, one or more application functionalities 1232(including functionality for implementing aspects of the softwaretransformation methods), other program modules 1234, and program data1236. More specifically, the ROM 1214 typically includes a basicinput/output system (BIOS) 1216. BIOS 1216 contains the basic routinesthat help to transfer information between elements within computingdevice 1202, such as during start-up. The RAM 1212 typically containsthe operating system 1230′, one or more applications functionalities1232′, other program modules 1234′ and program data 1236′, in a formthat can be quickly accessed by the processor 1206. The content in theRAM 1212 is typically transferred to and from one or more of the massstorage devices (e.g., hard disk drive 1218), for non-volatile storagethereof.

It is appreciated that the illustrated operating environment 1200 isonly one example of a suitable operating environment and is not intendedto suggest any limitation as to the scope of use or functionality of theinvention. Other well-known computing systems, environments and/orconfigurations that may be suitable for use with the invention include,but are not limited to personal computers, server computers, hand-heldor laptop devices, multiprocessor systems, microprocessor-base systems,set top boxes, game consoles, programmable consumer electronics, networkPCs, minicomputers, mainframe computers, distributed computingenvironments that include any of the above systems or devices, and/orthe like.

CONCLUSION

In closing, although the various embodiments have been described inlanguage specific to structural features and/or methodological acts, itis to be understood that the subject matter defined in the appendedclaims is not necessarily limited to the specific features or actsdescribed. Rather, the specific features and acts are disclosed asexemplary forms of implementing the claimed subject matter.

The invention claimed is:
 1. A method, comprising: parsing, at acomputing device, a first video into a plurality of shots that includesone or more shot boundaries; and determining, at the computing device,one or more insertion points for inserting a second video into the firstvideo based on a discontinuity and an attractiveness of each of the oneor more shot boundaries, the discontinuity of a shot boundary being ameasure of dissimilarity between a pair of shots that are adjacent tothe shot boundary, and the attractiveness of the shot boundary being anamount of viewer attention that a corresponding shot boundary attractsthat is estimated based on applying one or more attention models to thecorresponding shot boundary and combining results from the one or moreattention models in accordance with at least one of a linear weightedrelationship characterizing the one or more attention models or anon-linear increasing relationship characterizing the one or moreattention models.
 2. The method of claim 1, wherein the determining theone or more insertion points includes: computing a degree ofdiscontinuity for each of the one or more shot boundaries; computing adegree of attractiveness for each of the one or more shot boundaries;determining one or more insertion points based on the degree ofdiscontinuity and the degree of attractiveness of each shot boundary;and inserting the second video at the one or more determined insertionpoints of the first video to form an integrated video stream.
 3. Themethod of claim 2, further comprising providing the integrated videostream for playback, and assessing effectiveness of the one or moreinsertion points based on viewer feedback to a played integrated videostream.
 4. The method of claim 2, wherein the determining the one ormore insertion points includes finding peaks in a linear combination ofdegrees of discontinuity and degrees of attractiveness of a plurality ofshot boundaries.
 5. The method of claim 2, wherein each shot boundarycomprises a visual shot boundary or an audio shot boundary, and whereinthe determining the one or more insertion points further includeslinearly combining degrees of discontinuity of one or more visual shotboundaries and one or more audio shot boundaries.
 6. The method of claim2, wherein computing the degree of discontinuity for each of the one ormore shot boundaries includes computing at least one of a degree ofcontent discontinuity or a degree of semantic discontinuity for eachshot boundary.
 7. The method of claim 6, wherein the computing thedegree of content discontinuity for each shot boundary includes: using amerge method to merge one or more pairs of adjacent shots, wherein theeach pair of adjacent shots includes a shot boundary; assigning amerging order to each shot boundary, the merging order being assignedbased on a chronological order in which a corresponding pair of adjacentshots are merged; and calculating a content discontinuity value for eachshot boundary based on the corresponding merging order.
 8. The method ofclaim 6, wherein the computing the degree of the semantic discontinuityfor each of the one or more shot boundaries includes: using one or morefirst concept detectors to determine a first confidence score between afirst shot and at least one corresponding first attribute; using one ormore second concept detectors to determine a second confidence scorebetween a second shot and at least one corresponding second attribute;composing at least the first confidence score into a first vector,composing at least the second confidence score into a second vector; andcalculating a semantic discontinuity value for a shot boundary betweenthe first and second shots based on a difference between the firstvector and the second vector.
 9. The method of claim 8, wherein theusing the one or more first concept detectors and the one or more secondconcept detectors includes using concept detectors that are based onsupport vector machines (SVMs).
 10. The method of claim 2, wherein thecomputing the degree of attractiveness includes computing the degree ofattractiveness for each of the plurality of shots using at least one ofa motion attention model, a static attention model, a semantic attentionmodel, or a guided attention model.
 11. The method of claim 10, whereinthe computing the degree of attractiveness further includes computingthe degree of attractiveness for each of the plurality of shots using atleast one of an audio saliency model, a speech attention model, or amusic attention model.
 12. The method of claim 11, wherein the computingthe degree of attractiveness for each of the one or more shot boundariesfurther includes: obtaining one or more visual attractiveness values forat least one shot from a corresponding group of visual attention models;obtaining one or more audio attractiveness values for the at least oneshot from the corresponding group of audio attention models; andcombining the one or more visual attractiveness values and the one ormore audio attractiveness values using one of the linear weightedrelationship or the non-linear increasing relationship to obtain thedegree of attractiveness for the at least one shot.
 13. The method ofclaim 12, wherein the computing the degree of attractiveness for each ofthe one or more shot boundaries further includes: acquiring a firstdegree of attractiveness for a first shot; acquiring a second degree ofattractiveness for a second shot that is adjacent the first shot;computing a degree of overall attractiveness, A(pi), by computing alinear combination of attractiveness degrees from the first and secondshots according to:A(p _(i))=λ×A(s _(i))+(1−λ)×A(s _(i+1)) wherein A(s_(i)) represents thedegree of attractiveness for the first shot, and A(s_(i+1)) representsthe degree of attractiveness for the second shot, and λ is a numberbetween 0 and
 1. 14. The method of claim 2, wherein the determining theone or more insertion points includes: constructing a linear combinationcurve of one or more degrees of overall discontinuity and one or moredegrees of attractiveness; and determining the one or more insertionpoints based on one or more peaks on the linear combination curve. 15.The method of claim 1, wherein the parsing the first video includesparsing the first video into the plurality of shots based on one ofvisual details of the first video or an audio stream of the first video.16. The method of claim 15, wherein the parsing the first video into theplurality of shots based on the visual details of the first videoincludes using one of a pair-wise comparison method, a likelihood ratiomethod, an intensity level histogram method, or a twin-comparisonmethod, and wherein the parsing the first video into the plurality ofshots based on the audio stream of the first video includes using aplurality of kernel support vector machines.
 17. A memory havingcomputer-executable instructions that are executable to perform actscomprising: parsing a first video into a plurality of shots, theplurality of shots includes one or more shot boundaries; computing adegree of overall discontinuity for each of the one or more shotboundaries, each degree of discontinuity being a measure ofdissimilarity between a pair of shots that are adjacent to acorresponding shot boundary; computing a degree of attractiveness foreach of the one or more shot boundaries, each degree of attractivenessbeing an amount of viewer attention that a corresponding shot boundaryattracts that is estimated based on applying one or more attentionmodels to the corresponding shot boundary and combining results from theone or more attention models in accordance with at least one of a linearweighted relationship characterizing the one or more attention models ora non-linear increasing relationship characterizing the one or moreattention models; determining one or more insertion points based on thedegree of overall discontinuity and the degree of attractiveness of eachshot boundary, the one or more insertion points being for inserting asecond video into the first video; and inserting the second video at theone or more determined insertion points to form an integrated videostream.
 18. The memory of claim 17, further comprising providing theintegrated video stream for playback, and assessing effectiveness of theone or more insertion points based on viewer feedback to a playedintegrated video stream.
 19. The memory of claim 17, wherein the parsingthe first video includes parsing the first video into the plurality ofshots using one of a pair-wise comparison method, a likelihood ratiomethod, an intensity level histogram method, or a twin-comparisonmethod.
 20. The memory of claim 17, wherein the computing the degree ofdiscontinuity for each shot boundary includes: computing a degree ofcontent discontinuity for a shot boundary; computing a degree ofsemantic discontinuity for the shot boundary; and computing a degree ofoverall discontinuity for the shot boundary based on an average of thedegree of content discontinuity and the degree of semanticdiscontinuity.
 21. The memory of claim 17, wherein computing the degreeof discontinuity for each shot boundary includes: computing a degree ofcontent discontinuity for a shot boundary; computing a degree ofsemantic discontinuity for the shot boundary; and computing the degreeof overall discontinuity, D(pi), by combining the degree of contentdiscontinuity and the degree of semantic discontinuity for the shotboundary according to:D(p _(i))=λD _(c)(p _(i))+(1−λ)D _(s)(p _(i)) wherein D_(c)(p_(i))represents the degree of content discontinuity, D_(s)(p_(i)) representsthe degree of semantic discontinuity, and λ is a number between 0 and 1.22. The memory of claim 17, wherein the determining the one or moreinsertion points includes: constructing a linear combination curve ofone or more degrees of overall discontinuity and one or more degrees ofattractiveness; and determining the one or more insertion points basedon one or more peaks on the linear combination curve.
 23. A system, thesystem comprising: one or more processors; and memory allocated forstoring a plurality of computer-executable instructions which areexecutable by the one or more processors, the computer-executableinstructions comprising: instructions for parsing a first video into aplurality of shots, the plurality of shots includes one or more shotboundaries; instructions for computing a degree of discontinuity foreach of the one or more shot boundaries, each degree of discontinuitybeing a measure of dissimilarity between a pair of shots that areadjacent to a corresponding shot boundary that is computed based on anaverage of a computed degree of content discontinuity and a computeddegree of semantic discontinuity for a corresponding shot boundary, thedegree of semantic discontinuity being computed based on at least oneattribute of the corresponding shot boundary that is detected by one ormore concept detectors; instructions for computing a degree ofattractiveness for each of the one or more shot boundaries, each degreeof attractiveness being an amount of viewer attention that acorresponding shot boundary attracts that is estimated based on applyingone or more attention models to the corresponding shot boundary andcombining results from the one or more attention models in accordancewith at least one of a linear weighted relationship characterizing theone or more attention models or a non-linear increasing relationshipcharacterizing the one or more attention models; instructions fordetermining one or more insertion points based on the degree ofdiscontinuity and the degree of attractiveness of each shot boundary,the one or more insertion points being for inserting a second video intothe first video; and instructions for inserting the second video at theone or more determined insertion points to form an integrated videostream.
 24. The system of claim 23, further comprising instructions forproviding the integrated video stream for playback, and assessingeffectiveness of the one or more insertion points based on viewerfeedback to a played integrated video stream.
 25. A memory havingcomputer-executable instructions that are executable to perform actscomprising: parsing a first video into a plurality of shots, theplurality of shots includes one or more shot boundaries; computing adegree of overall discontinuity for each of the one or more shotboundaries, each degree of discontinuity being a measure ofdissimilarity between a pair of shots that are adjacent to acorresponding shot boundary; computing a degree of attractiveness foreach of the one or more shot boundaries, each degree of attractivenessbeing an amount of viewer attention that a corresponding shot boundaryattracts that is estimated based on applying one or more attentionmodels to the corresponding shot boundary; determining one or moreinsertion points based on the degree of overall discontinuity and thedegree of attractiveness of each shot boundary, the one or moreinsertion points being for inserting a second video into the firstvideo, wherein determining the one or more insertion points comprisesconstructing a linear combination curve of one or more degrees ofoverall discontinuity and one or more degrees of attractiveness, anddetermining the one or more insertion points based on one or more peakson the linear combination curve; and inserting the second video at theone or more determined insertion points to form an integrated videostream.